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Abstract 


Most of the focus in Linux processor power manage- 
ment today has been on power managing the processor 
while it is active: cpuf req, which changes the proces- 
sor frequency and/or voltage and manages the proces- 
sor performance levels and power consumption based on 
processor load. Another dimension of processor power 
management is processor ‘idling’ power. 


Almost all mobile processors in the marketplace today 
support the concept of multiple processor idle states 
with varying amounts of power consumed in those idle 
states. Each such state will have an entry-exit latency 
associated with it. In general, there is a lot of at- 
tention shifting towards idle platform power and new 
platforms/processors are supporting multiple idle states 
with different power and wakeup latency characteristics. 
This emphasis on idle power and different processors 
supporting different number of idle states and different 
ways of entering these states, necessitates the need for a 
generic Linux kernel framework to manage idle proces- 
sors. 


This paper covers cpuidle, an effort towards a generic 
processor idle management framework in Linux kernel. 
The goal is to have a clean interface for any proces- 
sor hardware to make use of different processor idle 
levels and also provide abstraction between idle-drivers 
and idle-governors allowing independent development 
of drivers and governors. The target audiences are the 
developers who are keen to experiment with new idle 
governors on top of cpuidle, and developers who 
wants to use the cpuidle driver infrastructure in vari- 
ous architectures, and any one else who is keen to know 
about cpuidle. 


1 Introduction 


Almost all the mobile processors today support multiple 
idle states and the trend is spreading as processor power 


management and system power management gain im- 
portance for a variety of reasons. 


In typical system usage models, processor(s) spend a lot 
of their time idling (like while you are reading this paper 
on your laptop, with your favorite pdf-reader). Thus any 
power saved when system is idle will have big returns in 
terms of battery life, heat generated in the system, need 
for cooling, etc. 


But there is a trade-off between idling power and 
amount of state a processor saves and the amount of 
time it takes to enter and exit from this idle state. The 
idle enter-exit latency, if it is too high, may be visible 
with media applications like a DVD player. Such usage 
models will limit the usage of a particular idle state on 
the processor running this application, even though the 
idle state is power efficient. Similarly, if a processor idle 
state does not preserve the the contents of the proces- 
sor’s cache, some particular application which has some 
idle time may notice a performance degradation when 
this particular idle state is used. 


In order to manage this trade-off effectively, the kernel 
needs to know the characteristics of all idle states and 
also should understand the currently running applica- 
tions, and should take a well-informed decision about 
what idle state it wants to enter when processor goes to 
idle. 


To do this effectively and cleanly, there is a preliminary 
requirement of having clean and simple interfaces. Such 
an interface can provide consistent information to the 
user and ease the innovation and development in the area 
of processor idle management. 


cpuidle is a an effort in this direction and this pa- 
per provides insight into cpuidle. We start section 
2 with a background on processor power management 
and idle states. Section 3 provides the design descrip- 
tion of cpuidle. Section 4 talks about all the develop- 
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ments and advancements happening in cpuidle and 
some conclusions in section 5. 


2 Background 
2.1 Processor Power management 


Processor power management can be broadly classified 
into two classes. 


Processor active — various states a processor can be in 
while actively executing and retiring instructions. 
Processor frequency scaling, in which a proces- 
sor can run at different frequencies and or voltages 
falls under this class. So does processor thermal 
throttling, where processor runs slower due to duty 
cycle throttling. 


Linux cpufreg, extensively discussed in [4], [6], 
and [5], is a generic infrastructure that handles 
CPU frequency scaling. 


Processor idle — various states a processor can be in 
while it is idle and not retiring any instructions. 
The states here differ in amount of power the pro- 
cessor consumes while being in that state and also 
the latency to enter-exit this low-power idle state. 
There may also be other differences like preserv- 
ing the processor state across these idle states, etc. 
based on a specific processor. For example, a pro- 
cessor may only flush L1 cache in one idle state, 
but may flush L1 and L2 caches in another idle 
state. There can also be differences around when 
an idle state can be entered and what its impact 
will be on other logical or physical processors in 
the system. 


2.2 Processor idle states 


Currently, most of the processors in mobile and hand- 
held segments support multiple idle states. The prime 
objective here is to provide a more power-efficient sys- 
tem with longer battery life or fewer cooling require- 
ments. This feature is slowly moving up the chain into 
desktops and servers. This is much like processor fre- 
quency scaling which was mostly present in mobile pro- 
cessors a few years back, to most of the servers support- 
ing that feature today. Recent EnergyStar idle power 
regulations [2] are tending to make this faster, making 
this feature more common across a range of systems. 


2.3 Current Processor idle state support 


Below is a short summary of current processor idle state 
management in Linux 2.6.21 [3]. 


ACPI based idle states For the remainder of this sec- 
tion we restrict our attention to idle state support as 
in 1386 (and x86-64) architectures. 


In 1386 (and x86-64) architectures, there is support 
for ACPI-based [1] processor idle states. These 
states are referred to as C-states in ACPI termi- 
nology. Each of the ACPI C-states is charac- 
terised by its power consumption and wakeup la- 
tency, and also based on preservation of the pro- 
cessor state, while in this C-state. ACPI-based plat- 
forms will report processor idle capability to Linux 
using ACPI interfaces. A platform can dynamically 
change the number of C-states supported, based on 
different platform parameters such as whether it is 
running on battery or AC power. 


The current Linux support for such idle states is 
fully embedded in the drivers/acpi directory 
along with all ACPI support code. Code here de- 
tects the C-states available at boot time, handles 
any changes to the number of C-states during run 
time, and has simplistic policy to choose a par- 
ticular C-state to enter into whenever a CPU goes 
idle. This code includes various platform-specific 
bits, specific workarounds for platform ACPI bugs, 
and also a /proc-based interface exporting the C- 
state-related information to userspace. 


Arch specific idle—i386 and x86-64 i386 and x86-64 
(and also ia64) have some architecture-specific 
processor idle management that does not depend on 
ACPI. On 1386 and x86-64, it includes support for 
poll_idle, halt_idle, and mwait_idle. 
poll idle is a polling-based idle loop, which 
is not really power efficient, but will have very lit- 
tle wakeup overhead. halt_idle is based on 
the x86 hlt instruction, and mwait idle is 
based on the monitor mwait pair of instruc- 
tions. There are specific static rules regarding 
which of these idle routines will be used on any 
system, based on boot options and hardware capa- 
bilities. Further, boot options across x86 and x86- 
64 are not the same for these three idle routines. 

















Arch-specific idle—other architectures There are 
various other architectures that have their own 
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code for processor idle state management. This 
includes ia64 with PAL halt and PAL light halt, 
Power with nap and doze modes, and idle support 
for different platforms in the ARM architecture. 
Each of these types of support for idle states also 
comes with its own set of boot parameters and/or 
/proc or /sys interfaces to user-space. 


Bottom line There is very little sharing of code and 
sharing of idle management policies across archi- 
tectures. Processor idle state management and var- 
ious boot options, etc., are duplicated; this re- 
sults in code duplication and maintenance over- 
head. This, as well as the increasing focus on pro- 
cessor idle power in platforms, highlights the need 
for a generic processor idle framework in Linux 
kernel. 


3 Basic cpuidle infrastructure 


Figure 1 gives a high-level overview of the cpuidle 
architecture. The basic idea behind cpuidle is to sep- 
arate the idle state management policies from hardware- 
specific idle state drivers. At this level, the cpuidle 
model has similarities with cpufreq [6]. 


3.1 cpuidle core 


The cpuidle core provides a set of generic interfaces 
to manage processor idle features. 


3.11 cpuidle data structures 


A per-cpu cpuidle_device structure holds informa- 
tion about the number of idle states supported by each 
processor, information about each of those idle state (in 
an array of cpuidle state struct), and the sta- 
tus of this device, among other things. 


cpuidle_state is a structure that contains informa- 
tion about each individual state, power usage, exit la- 
tency, usage statistics of the state, etc. 


cpuidle core maintains separate linked lists of all reg- 
istered drivers, all registered governors, and all detected 
devices. 


cpuidle_lock is the lone mutex that handles all 
SMP orderings within cpuidle. 


3.1.2 Initialization and Registration 


Drivers can register and unregister with couidle core 
using cpuidle_register_driver and cpuidle_ 
unregister_driver. Governors can register and un- 
register using cpuidle register governor and 
cpuidle_unregister_governor. Each cpu de- 
vice gets detected on cpu add_device callback of 
cpu_sysdev. If there is a currently active governor 
and active driver, then the device gets initialized with 
those governor and driver. 





3.1.3 Idle handling 


cpuidle core has an idle handler, cpuidle_idle_ 
call(), that gets plugged into an architecture- 
independent pm idle function pointer, that will be 
used by each individual processor when it goes idle. Just 
before going into idle, the governor selects the best idle 
state to go into. And then cpuidle invokes the entry 
point for that particular state in the cpuidle driver. On 
returning from that state, there is an optional governor 
callback for the governor to capture information about 
idle state residency. 


3.1.4 Handling system state change 


The number and type of idle states can vary dynami- 
cally based on a given system state, like battery- or AC- 
powered, etc. Such a system state change notification 
goes to the idle driver, which will invoke cpuidle_ 
force_redetect () inthe cpuidle core. This re- 
sults in the idle handler being temporarily uninstalled 
and the idle states being re-detected by the driver, fol- 
lowed by re-initialization of the governor state to take 
note of this change. 


3.2 Design guidelines 


There were few conscious design decisions/trade-offs in 
cpuidle. 


3.21 cpu idle wait 


To make sure we do not take a lock during the normal 
idle routine entry-exit, and to be able to safely change 
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/sys/devices/system/cpu/cpuX/cpuidle 













User-level 
interfaces l 
/sys/devices/system/cpu/cpuidle 
governors ladder 
Generic cpuidle infrastructure 
drivers | | acpi-cpuidle halt_idle 


ACPI processor driver 


| arch/platform specific drivers 








Figure 1: cpuidle overview 


the governor/driver at run time, cpu. idle wait was 
used. Note that changing of drivers/governors is an un- 
common event which will not be performance-sensitive. 


3.2.2 system-level governor and driver 

Should cpuidle support a single driver and single 
governor for the whole system, or should they be per- 
cpu? Considering the advantage of keeping things sim- 
ple with a system-level governor and driver with respect 
to usage of per-cpu-based governor and driver, it was de- 
cided to have a single system-level governor and driver. 


3.2.3 No cpu_hotplug_lock in cpuidle 


Learning from experiences of coufreq and cpu_ 
hotplug_lock, cpuidle avoids using cpu_ 
hotplug_lock in the entire subsystem. This in fact 
resulted in a cleaner self-contained SMP and hotplug 
synchronization model for cpuidle. 

















3.2.4 Runtime governor/driver switching 


Even though runtime switching of the governor and 
driver can result in potential wrong usages by the end- 
users, cpuidle supports runtime switching of the gov- 
ernor or driver, mostly to help developers and testers of 


cpuidle. In the future, this switching of driver and 
governor can be disabled by default, in order to avoid 
incorrect usage. 


3.3 driver interface 


The cpuidle_register_driver uses a structure 
that defines the cpuidle driver interface: 


struct cpuidle_driver { 
char name [CPUIDLE NAME LEN]; 
struct list head driver list; 


int (xinit) (struct cpuidle device x«dev); 
void (*exit) (struct cpuidle device x«dev); 
int («redetect) (struct cpuidle device x«dev); 
int (xbm check) (void); 


struct module 
}; 


xowner; 


init () is a callback, called by cpuidle to initial- 
ize each device in the system with this specific driver. 
exit () is called to exit this particular driver for each 
device. The redetect () callback is used to re- 
detect the device states, on certain system state changes. 
bm_check () is used to note the bus mastering status 
on the device. In init (), the driver has to initialize all 
the states for the particular device and handle the total 
state count for that device. 








struct cpuidle state ( 
char name [CPUIDLE NAME LEN]; 
void «driver data; 


unsigned int flags; 

unsigned int exit latency; /x in US x/ 
unsigned int power usage; /x in mW x/ 
unsigned int target residency; /x in US x/ 


unsigned int usage; 
unsigned int time; /* in US x/ 


int (*enter) (struct cpuidle device «dev, 


struct cpuidle state x«state); 


struct kobject kobj; 
Yi 


enter () is the callback used to actually enter this idle 
state. exit_latency and power_usage will be 
characteristic of the idle state. flags denote generic 
capabilities, features, and bugs of the idle state. usage 
is the count of times this idle state is invoked, and time 
is time spent in this state. 


cpuidle_register_driver() and cpuidle_ 
unregister_driver() are used to register and 
unregister (respectively) a driver with cpuidle. 
cpuidle_force_detect () is used by the driver 
to force the cpuidle core to re-detect all the device 
states (e.g., after a system state change). 


3.4 governor interface 


struct cpuidle_governor { 





char name [CPUIDLE NAME LEN]; 
struct list head governor list; 

int (*init) (struct cpuidle device xdev); 
void (*exit) (struct cpuidle device xdev); 
void (xscan) (struct cpuidle device xdev); 
int («select) (struct cpuidle device xdev); 
void (*reflect) (struct cpuidle device xdev); 
struct module owner; 


Yi 


init () isacallback, called by cpuidle, to initialize 
each governor with a specific device. exit () is called 
to exit this governor for a device. 


scan () is called on a re-detect of the states in the de- 
vice. This provides an opportunity for the governor to 
note the changes in states during a driver re-detect. 


select () is called before each idle entry by a de- 
vice, for the governor to make a state selection for 
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the idle call. reflect() is called after an idle 
exit, for the governor to capture information about idle 
state residency. Note that time spent in the governor’s 
reflect () is in the critical path (on exit from idle, 
before starting the work) and hence has to be fast. 


cpuidle_register_governor() and cpuidle_ 
unregister_governor() are used to register and 
unregister (respectively) a governor with cpuidle. 
cpuidle_get_bm_activity () gets the informa- 
tion about bm activity, which can be used by the gover- 
nor during its select routine. 





3.5 Userspace interface 


cpuidle userspace interfaces are split at the following 
two places in /sys. 


3.5.1 System-generic information 


This information is under /sys/devices/system/ 
cpu/cpuidle/. 


available drivers is a read-only interface that 
lists all the drivers that have successfully registered 
with cpuidle. 





current driver isaread-write interface that con- 
tains the current active cpuidle driver. By writ- 
ing a new value to this interface, the idle driver can 
be changed at run time. 


available governors is a read-only interface 
that lists all the governors that have successfully 
registered with cpuidle. 


current governor is a read-write interface that 
contains the current active couidle governor. By 
writing a new value to this interface, the idle gov- 
ernor can be changed at run-time. 


Note there can be single governor and single driver 
for all processors in the system. 


3.5.2 Per-cpu information 


This information is under /sys/devices/system/ 
cpu/cpuX/cpuidle/ where X=0,1,2,.... For 
each idle state Y supported by the current driver, the fol- 
lowing read-only information can be seen under sysfs. 
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stateY/usage: Shows the count of number of 
times this idle state has been entered since the last 
driver init or redetect. 


stateY/time: Shows the amount of time spent 
in this idle state in uS. itemstateY/latency: 
Shows the wakeup latency for this state. 


stateY/power: Shows the typical power consumed 
when CPU enters this state in mW. 


3.6 Configuring and using cpuidle 
To configure cpuidle, select: 


Main Kernel Config 
Power management options (ACPI, APM) ---> 
CPU idle PM support ---> 
[ ] CPU idle PM support 


Once CPU idle PMis selected, there will be further 
options for various governors supported in the kernel, 
which can then be selected. 


<*> ‘ladder’ governor (NEW) 
<*> “menu” governor (NEW) 


Currently cpuidle is supported only on i386 and x86- 
64, with an ACPI-based idle driver. 


4 cpuidle advancements 


The current cpuidle changes are the beginning of 
things to come. There are a few things under develop- 
ment and discussion. 


4.1 New governors 


The ladder governor takes a step-wise approach to se- 
lecting an idle state. Although this works fine with pe- 
riodic tick-based kernels, this step-wise model will not 
work very well with tickless kernels. The kernel can go 
idle for a long time without a periodic timer tick and it 
may not get a chance to step-down the ladder to the deep 
idle state whenever it goes idle. 


A new idle governor to handle this, called the menu 
governor, is being worked on. The menu governor looks 
at different parameters like what the expected sleep time 


is (as seen by dyntick), latency requirements, previous 
C-state residency, max_cstate requirement, and bm 
activity, etc., and then picks the deepest possible idle 
state straight away. This governor aims at getting max- 
imum possible power advantage with little impact on 
performance. 


4.2 Power data 


Power/Performance data with various idle policies will 
be provided at the time of presentation of this paper. 


4.3 Future Work 


Below is some of the items from the cpuidle to-do 
list. The list below is not exhaustive. Specifically, if you 
don't find your favorite architecture mentioned here and 
you would like to use cpuidle on your architecture, let 
the authors of this paper know about it. 


Today, CPU logical offline does not take CPU to its 
deepest idle state. “There are thoughts about using 
cpuidle to enter the deepest idle state when a CPU 
is logically offlined. 





cpuidle needs to be more flexible with regards to dif- 
ferent non-ACPI-based idle drivers supported, and also 
support run-time switching across these drivers. 


Make cpuidle simple by default, and make it use the 
right driver and right governor for a platform by using a 
rating scheme for drivers and governors. This will avoid 
all the issues with users/distributions needing to config- 
ure cpuidle at every boot. 


Experiment with different governors to find the most 
power/performance efficient governor for specific plat- 
forms. This will be an ongoing exercise as more plat- 
forms support multiple idle states and use the cpuidle 
infrastructure. 


5 Conclusion 


The authors hope that cpuidle infrastructure enables 
Linux to have a platform-independent, generic infras- 
tructure for processor idle management. Such an in- 
frastructure will simplify support of idle states on spe- 
cific hardware by making it possible to write a simple 
plug-in driver. Additionally, such an infrastructure will 





simplify the writing of idle governors, and hopefully 
will increase experimentation and innovation in idle 
governors—something similar to the frequency gover- 
nors that resulted from the cpufregq infrastructure. 
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